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ABSTRACT 

A  detailed  description  is  given  of  the  network  connecting  processors  (PE's) 
to  memory  modules  (MNfs)  in  the  NYU  Ultracomputer,  and  of  the  switches  used 
to  realize  this  network. 

The  network  is  a  packet  switched  multistage  shuffle-exchange  network.  AD 
the  switches  arc  idcntiol,  and  built  of  a  small  number  of  chips:  two  monolithic 
integrated  circuits  and  one  small  arbiter  chip. 

In  addition  to  routing,  these  chips  can  queue  information  when  the  network 
is  congested.  A  third  function  is  the  combining  of  Like  reqiiests  to  the  same 
memory  address.  This  capability  allows  efficient  concurrent  access  by  many  PE's 
to  the  same  location  in  memory,  and  effectively  distritnites  the  support  of  syn- 
chronization primitives  (Fctch&Add)  over  the  entire  network. 

The  logic  of  each  individual  switch  is  pipelined,  and  several  systolic  struo- 
tures  are  used  to  achieve  a  high  throughput. 


1.   IntrcdactioD 

The  purpose  of  this  paper  is  to  describe  the  TJltraswitch"  (USW),  a  custom  designed  VLSI 
switching  device  which  composes  the  processor  to  memory  communication  network  in  the  NYU 
Ultracomputer.  While  the  technical  description  of  the  USW  will  be  centered  on  the  particular 
design  used  for  the  NYU  Ultracomputer,  the  basic  design  of  this  component  has  wider  applica- 
tions.   We  assume  the  reader  is  familiar  with  the  architecture  of  the  NYU  Ultracomputer  as 


described  in  [GGKMRS],  and  in  particuJar  with  the  design  of  the  ccmmunication  ncVMork  and  the 
functions  of  the  switches  within  this  network. 

A  typical  architecture  for  a  MIMD,  shared  raemry  parallel  computer  consists  of  a  large 
niunbcr  of  autonomous  processing  elements  (PE's)  arc  connected  through  a  connection  network 
(CN)  to  a  shared  memory  consisting  of  a  large  number  of  separate  memory  modules  (MJvTs). 
The  connection  network  used  in  the  NYU  Ultracomputer  has  the  following  pTor,.rr6a. 

1.  It  is  message  switched.  Request  messages  are  sent  from  the  FE's  and  forwarded  by  succes- 
sive switches  of  the  network  until  they  reach  the  destination  MM.  Replies  are  routed  back 
in  the  same  way. 

2.  The  communication  network  has  the  topology  of  a  E>elta- network  [KS]  It  connects  N=n'^ 
FE's  to  M=m^  MM's  through  d  stages  of  mm  switches*.  Each  MM  is  assigned  a  m-ary 
address  d  bits  long.  The  i'th  digit  of  the  MM  number  indicates  how  a  message  for  that  MM 
is  routed  by  a  switch  at  the  i'th  stage  of  the  network.  In  the  present  Ultracomputer  design 
n=m=2,  and  we  shall  for  simplicity  assiime  these  values  in  the  sequel.  While  the  topology 
of  the  network  affects  the  routing  logic  of  the  USW,  it  does  not  affect  the  other  aspects  of 
the  USW  structure. 

3.  The  reply  to  a  message  is  routed  through  the  same  path  the  message  was  sent,  in  the  reverse 
direction  (If  a  Delta-network  topology  is  used  this  follows  from  the  fact  that  there  exist  a 
unique  path  from  each  input  to  each  output). 

4.  All  the  requests  sent  to  memory  are  of  the  form 

FErai&OP(addr,val), 
where  OP  is  an  associative  operation,  addr  is  a  memory  address  and  val  is  a  value.    The 
effect  of  this  instruction  is  to  indivisibly 

a.       Return  @addr,  the  value  stored  at  address. 


"Each  switch  is  bidirecnonal,  so  ti\2X  we  have  n  input  parts  and  m  outjxu  pcrrs  at  the  PE  to  NtSl  direcnon, 
and  m  input  pcara  and  n  output  pcrts  at  the  MM  to  PE  directicn. 
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b.      Set  @addr  <-  OP(@addr,va]). 

Examples  of  such  operaticms  are 

a.  LO,\D:  OP(a,b)  =  a  (the  value  sent  is  ignored). 

b.  STORE:  OP(a,b)  =  b  (the  value  returned  is  ignored). 

c.  FETCH  AND  ADD:  OP(a,b)  =  a  +  b. 

d.  TEST  AND  SET:  OP(a,b)  =  a  or  b. 

The  network  hard^varc  will  support  the  first  three  types  of  operations*.  This  is  not  an 
inherent  limitation;  the  design  could  be  trivially  extended  to  support  any  predefined  set  of 
the  above  operations. 

2.  Routing  Logk 

The  routing  logic  in  a  Delta-network  is  very  simple:  Each  message  starts  with  a  header  con- 
taining the  destination  (MM)  address.  A  switch  uses  the  first  bit  (digit)  of  this  header  to  select 
the  C'jtput  port  where  the  message  is  forwarded,  and  then  discards  it.  Thus  the  successive  digits 
of  the  MM  number  are  used  by  switches  at  the  successive  stages  of  the  network.  The  retxim 
address  is  dynamically  created  as  tlie  message  is  forwarded  through  the  network.  Each  switch 
appends  to  the  end  of  the  routing  header  a  bit  (digit)  idcntifjing  the  port  wtiere  the  message  was 
received.  Thus  when  the  message  leaves  the  network  it  contains  a  complete  return  address  that 
can  be  used  to  route  the  reply  through  the  forward  path  traversed  by  the  message,  but  now  in  the 
reverse  direction.  Note  that  this  scheme  for  creating  a  return  address  can  be  applied  to  any  net- 
work. At  each  switch,  the  amalgamation  of  a  partial  forward  address  and  a  partial  return  address 
contained  in  the  routing  header  uniquely  identifies  the  sender  and  the  receiver  of  the  message. 

This  nonadaptive  routing  scheme  docs  not  prevent  conflicts  between  messages.  To  handle 
conflias  one  or  more  queues  are  associated  with  each  output  port  of  a  switch".   Messages  going 


•  We  also  support  STOREs  of  hall  and  qiiancr  words. 

'*  There  is  eifectively  one  queu«  iat  each  inpui  pert  -  output  port  pair  in  the  switch. 


from  a  given  PE  to  a  given  MM  are  processed  in  order  of  arrival,  so  that  replies  to  requests  issued 
by  a  PE  to  a  given  MM  arrives  in  the  order  the  requests  were  issued.  Note  however  that  requests 
sent  to  different  MM's  need  not  be  processed  in  the  order  they  were  issued.  A  handshaking  pro- 
tocol between  adjacent  switches  is  used  to  prevent  overflow.  Since  a  PE  is  always  willing  to  accept 
a  reply  to  an  outstanding  request,  deadlock  cannot  occur  in  the  network. 

3.   Combining  Mzssagea. 

Messages  destinated  to  the  same  memory  location,  and  having  the  same  operation  are  com- 
bined if  they  meet  at  a  switch.  Only  one,  combined  request  is  forwarded  to  the  MM.  Sufficient 
information  is  also  stored  at  the  switch,  enabling  the  "decombining"  of  the  returning  message  into 
its  two  constituent  messages.  When  the  reply  to  this  combined  request  comes  back  to  the  switch 
where  the  combining  occurred  it  is  split  into  two  distinct  replies  to  the  two  original  requests.  Sup- 
pose that  two  messages  corresponding  to  the  opcradons  FETCH&OP(addr,vall)  and 
FETCH&OP(addr,val2)  are  sent  by  two  distina  PE's  and  conflict  at  the  same  switch.  The  switch 
forwards  the  combined  request 

FFrCH&OP(@addr,  OP(vall,val2)) 
and  stores  the  value  vail.   Assuming  no  further  combinations  occur,  the  content  of  location  addr 
is  updated  to  contain 

OP(@addr,  OP(vall,val2))  =  OP(OP(@addr,vall),val2). 
The  old  value  @addr  is  sent  back.  When  the  reply  containing  this  value  readies  the  switch  where 
the  combining  ooou'ed  the  value  @addr  is  sent  back  to  satisfy  the  first  request,  and  the  value 
OP(@addr,vall)  is  sent  back  to  satisfy  the  second  request.  Note  that  the  final  effect  of  this  is  as 
if  the  two  requests  were  serially  satisfied  by  the  MM-  This  serialization  principle  is  an  important 
paradigm  in  reasoning  about  the  non-determinism  of  the  network.  Tlie  same  holds  true  if  com- 
bined reqiiests  are  further  combined  at  subsequent  stages  of  the  network. 

We  specialize  now  this  general  scheme  to  our  framework: 
LOAD  -  LOAD:  One  load  is  executed  and  the  loaded  value  is  forwarded  to  both  FE's. 


STORE  -  STORE:  The  new*  store  takes  effect,  while  th.e  old  one  is  ignored.  The  ackno\vIcdg- 
ment  to  the  STORE  is  forwarded  to  both  PE's".  Although  stores  can  be  performed  on  partial 
words,  we  airrently  support  only  merging  of  STORE  instructions  affecting  the  same  part  of  the 
word. 

FETCH&ADD  -  FETCH&ADD:  The  sun  of  t!is  values  for^.-irded  is  sent  to  the  memory,  and  the 
old  value  is  stored.  The  returning  reply  is  forwarded  to  satisfy  the  first  request,  end  the  sum  of 
the  returned  value  and  the  stored  value  is  forwarded  to  satisfy  the  second  request. 

It  is  actually  possible  to  merge  any  two  requests  de^tinated  to  the  same  memory  location. 
Our  present  design,  however  suprports  only  die  ccrabination  of  requests  with  the  same  operation. 
It  is  also  possible  to  support  combining  of  more  than  two  messages  at  a  switch.  Our  present 
design  supports  only  combining  of  pairs  at  each  switch. 

It  must  be  possible  to  identify  uniquely  the  b.vo  requests  that  were  combined  when  the  reply 
to  that  combined  request  arrives  back  to  the  switch.  We  shall  assume  that  no  PE  has  more  than 
one  outstanding  request  of  the  same  type  to  the  same  memory  location.  Thus  at  each  switch,  the 
address  field  of  a  message  which  consi'^ts  of  amalgamated  FE/MM  number,  the  address  uithin  the 
MM,  and  the  opcode  provides  a  unique  identification  of  the  message.  The  message  obtained  by 
combinin-;  two  messages  is  sent  with  the  address  field  of  the  first  combined  rr.essage,  and  the 
address  fields  of  both  combined  messages  is  stored  at  Lhe  switch* *'. 

4.   Ultraswitch  structure 

The  structure  of  the  USW  is  schematically  described  in  Figure  1 . 


•  Vvliile  the  ttnporal  a-i'ering  of  ccmhned  requests  is  irtmaterial  t.V  hjirdwaie  will  cnSLiie  that  if  request  A 

is  combined  with  request  B  and  A  arrived  before  B  to  the  sw-itch  '.Vn  ti-.c  effect  of  L'-.e  combining  is  as  if  A 

was  executed  before  B. 

"  Nae  that  the  ignored  STORE  should  not  be  acknowledged  before  &.i  ackno'^i'let^.gment  to  the  executed 

STORE  returns  from  meiriDry. 

***  Actually  most  of  the  informaaan  in  these  tT.vo  address  fields  is  identica!  and  need  not  be  rcpiicated. 
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Combine  _Queue  j 
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WaitJufferQ 
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PEujMMoui,         hfMu^PEin, 
Figure  1 . 
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TIM^Eout^ 


It  consists  essentially  of  two  2x2  routing  devices,  each  transmitting  messages  from  their  input  ports 
to  the  appropriate  output  perl  on  the  other  side.  The  PEjo_MM  device  routes  requests  received 
at  the  FromPEi  input  ports  to  the  ToMMj  output  porta,  whereas  the  MMjoJ'E  device  routes 
replies  from  the  FromMMj  input  ports  to  the  ToPEi  output  ports.  The  Corabinc_Qucue  associated 
with  each  ToMM  output  port  fulfills  three  functions:  It  is  a  FIFO  buffer  for  requests  waiting  to  be 
transmitted  to  the  next  stage;  it  performs  the  switching  function;  It  also  contains  the  combine  logic 
required  to  combine  messages  with  identical  destinaticms.  The  Wait_Buffcr  associated  with  the 
FromMMj  input  port  stores  sufficient  information  about  a  combined  request  sent  through  the 
ToMM,  output  port,  so  that  upon  return  of  the  answer  to  a  combined  message,  it  can  detect  and 
decorabine  the  request  into  its  constituent  parts. 
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Each  message  received  at  a  FromPE  port  contains  the  foUov/ing  four  fields: 

1.  PD,  the  path  descriptor  (amalgarmiicd  PBTvIM  number); 

2.  lADD,  the  internal  MM  address; 

3.  OPC,  the  operation  opcode; 

4        DATA,  the  data  item. 

The  path  descriptor  has  the  form  pra,  where  p  is  a  prefLX  of  the  number  of  the  sending  PE, 
and  m  is  a  suffix  of  the  number  of  the  receiving  MM,  and  jpl  =  i-1  for  switches  in  the  i-th  stage  of 
the  network.  K  the  opcode  identifies  a  load  operation  then  the  data  item  is  empty. 

When  a  request  of  the  form  <pm  lADD  OPC  DATA>  is  accepted  at  a  FromPE  port  the 
path  descriptor  is  rotated  and  modified,  as  described  in  the  last  section,  and  the  request  is  routed 
to  the  Combine_Queue,  on  the  side  that  it  will  be  output.  If  no  message  of  the  form  <p'm  lADD 
OPC  DATA'>,  with  matching  address,  internal  address,  and  opcode  is  present  in  the  queue,  then 
the  new  message  joins  the  end  of  the  queue.  If  such  a  message  is  already  in  the  queue,  then  the 
new  message  is  combined  with  the  old  one.  To  effect  tlie  serialization  of  the  old  message  fol- 
lowed by  new  message,  the  USW  performs  the  following  actions:  If  the  reqxiest  is  a  store  then  the 
old  DATA  is  replaced  by  the  new  DATA';  If  the  request  is  a  FFTCH&ADD  then  the  old  data 
DATA  is  replaced  by  DATA+DATA'.  In  addition,  the  old  message  is  stored  in  the  appropriate 
Wait_Buffer,  together  with  p'.  Thus,  the  item  stored  in  the  Wait_Buffer  is  of  the  form  <p'  pra 
lADD  OPC  DATA>.  The  DATA  field  has  to  be  stored  in  the  WaiL_Buffcr  only  for 
FETCH&ADD  requests. 

We  assume  that  replies  coming  back  from  the  MM's  have  the  ^ume  format  as  the  requests 
sent  from  the  PE's,  the  only  difference  being  that  replies  to  STORE  requests  have  an  empty  data 
field,  whereas  replies  to  LOAD  requests  have  now  a  nonempty  data  field.  When  a  reply  of  the 
form  <f3m  LADD  OPC  DATA>  is  received  at  a  FromMM  port,  it  is  simultaneously  routed  to  the 
appropriate  ToPE  port  and  matched  against  entries  in  the  Wait_Buffer  associated  with  that 
FromMM  port.    If  a   matching  entry  of  the  form  <p  p'm  lADD  OPC  DATA'>  is  found,  then 
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this  entry  is  deleted  and  a  new  reply  is  created,  which  is  of  the  form  <p'm  lADD  OPC  DATA">, 
where  DATA"  is  empty  if  the  request  is  a  store,  DATA"  -  DATA  if  the  request  is  a  LOAD,  and 
DATA"  =  DATA+DATA"  if  the  request  is  a  FETCH  AND  ADD. 

The  structure  described  above  has  one  drawback:  K  each  input  port  receives  one  message 
than  the  two  messages  will  conflict  at  the  same  output  port  with  probability  half.  Thus,  if  the 
maxima]  throughput  of  each  port  is  one  message  per  time  unit,  the  average  throughput  each  port 
will  be  less  than  3/4. 

The  performance  of  the  switch  can  be  enhanced  by  adding  one  queue  to  each  of  the  four 
input-output  paths  on  the  switch  (i.e.  associating  two  queues  to  each  input  port,  or  associating 
two  queues  with  each  output  port).  This  design  has  the  critical  flaw  of  containing,  under  most  cir- 
cumstances, only  one  message  in  the  Combine.Queue,  thus  drasticaUy  decreasLng  the  opportuni- 
ties for  combination.  There  would  be  multiple  messages  in  the  simple  queue;  however,  these  are 
unable  to  combine. 

This  rather  serious  drawL-ick  can  be  eliminated  by  replacing  each  of  the  four  simple  queues 
on  the  switch  by  combining  queues.  Each  of  these  is  fed  from  a  unique  input  port,  and  feeds  a 
unique  output  port.  Arbitration  logic  is  used  to  arbitrate  between  the  two  queues  connected  to  the 
same  output  port.  In  this  design  no  combining  can  occur  at  the  first  stage,  so  that  one  stage  of 
combining  is  lost  (unless  a  different  structure  is  used  there).  The  design  can  be  improved  by 
merging  the  two  combining  queues  associated  with  the  same  output  into  one  structure.  The  result- 
ing structure  has  the  same  interface  to  the  outside  world  as  the  previous  structure. 

On  the  return  path  to  each  ToMM  pon  are  associated  four  FIFO  buffers  fed  from  the  two 
FromMM  input  ports  and  the  two  Wait_Buffers  connected  to  the  port.  Again,  these  FIFO  buffers 
can  be  merged  into  one  structure 

This  configuration  achieves  the  following  two  design  goals: 
1.       A  switch  is  always  willing  to  accept  a  new  message  on  each  input  port,  as  long  as  the 

corresponding  queues  are  not  full;  Tlie  probability  of  full  queues  can  be  reduced  to  any 
desired  level  by  increasing  the  queue  size  (as  long  as  the  switch  utilization  is  below  one). 


2.  As  long  as  buffers  are  not  fuU,  a  message  directed  to  a  given  output  port  wil]  not  be  delayed 
by  messages  directed  to  other  output  ports.  Again,  the  probability  of  full  bxiffers  can  be 
reduced  to  any  desired  level  by  increasing  the  buffers  sizes. 

The  arbitration  logic  in  such  device  may  be  kept  very  simple,  so  that  arbitration  time  is  Insig- 
nificant (arbitration  for  the  next  cycle  may  also  be  overlapped  with  transmission  at  the  current 
cycle).  This  device  is  therefore  "non  blocking"  (see  [DJ]). 

The  present  design  of  the  Corabine_Queue  supports  only  the  combination  of  pairs  of 
requests  at  each  switch.  If  a  new  request  matches  the  combination  of  two  previous  requests  in  the 
Corabine_Qucue,  no  new  combiiiation  occur.  While  this  restriction  reduces  the  number  of  request 
combinations  that  will  occur,  it  greatly  simplifies  the  design  of  the  USW:  The  need  for  an  associa- 
tive search  in  the  Combine_Queue  is  avoided,  and  the  associative  search  in  the  Wait_Buffer  is 
guaranteed  to  yield  at  most  one  match. 

4.1.   Pipelining 

The  NYU  Ultracomputer  prototype  will  contain  up  to  64  32-bit  processors  connected  to  64 
memory  modules  containiiig  each  1/2  a  megatiyte  of  memory.  Each  data  item  contains  32  bits, 
and  30  additional  bits  are  required  for  the  path  descriptor,  the  internal  address,  the  opcode  and 
for  error  detection.  A  switch  may  be  receiving  and  transmitting  at  each  cycle  up  to  8  messages. 
These  numbers  are  likely  to  increase  significantly  for  a  real  machine.  A  fully  parallel  version 
would  need  to  support  the  transfer  of  64x8  =  512  data  bits,  and  some  30  additional  control  signals 
on  or  off  chip  at  each  cycle.  While  the  gate  count  of  the  circuits  required  to  implement  an  USW 
does  not  preclude  a  one-diip  implementation,  the  main  impediment  to  such  integration  is  the  high 
pin  count  required. 

Two  possible  solutions  can  be  applied:  E.'^xh  message  is  split  into  several  slices  or  packets. 
One  can  use  a  bit  slice  implementation  of  the  USW,  so  that  different  components  are  handling  the 
different  packets  of  one  message.  The  transmission  of  messages  is  "spacemulnpUxed' .  On  the 
other  hand,  one  can  time-multipUx  the  transmission  of  the  successive  packets  of  a  message  to  the 
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same  component. 

"Space-multiplexing"  enables  a  higher  bandwidth  than  time-multiplexing,  at  the  expense  of 
increased  logic.  Note  however  that  a  large  amount  of  "horizontal"  communication  and  coordina- 
tion must  take  place  between  the  different  components  of  a  switch,  as  routing  decisions  and  com- 
bining decisions  have  a  global  effect.  This  is  likely  to  further  increase  the  complexity  of  such 
implementation  and  to  slow  down  the  switch  cycle.  For  MOS  technologies,  the  off  chip  delays 
impose  an  especially  harsh  penalty  (overhead),  as  the  on-chip  logic  can  be  qiiite  fast.  Also,  such 
implementation  is  less  flexible  as  the  width  of  the  data  paths  is  hard-wired.  In  particular,  no  gain 
is  accrued  from  the  fact  that  some  messages  are  shorter,  and  contain  an  empty  data  field. 

If  time-multiplexing  is  used  several  cycles  are  now  required  to  transmit  a  message.  Wc  shall 
show  however  that  the  internal  logic  of  the  USW  can  be  pipelined:  Messages  do  not  have  to  be 
assembled  at  each  switch,  but  rather  can  be  handled  or  a  per  packet  basis.  Thus,  when  queues 
are  empty  there  is  only  one  cycle  delay  per  switch  for  a  request.  The  time-multiplexing  adds  an 
additive  term  to  the  delay  rather  then  a  multiplicative  factor.  Note  however,  that  queuing  delays 
increase  multiplicatively  with  the  multiplexing  factor,  so  that  the  performance  of  the  network 
under  heavy  load  may  be  seriously  impaired  (see  [KS]  for  a  more  detailed  analysis). 

The  rn-esent  design  for  the  USW  is  for  a  time-multiplexed  switch  with  pipelined  logic.  The 
performance  of  such  switch  implemented  in  nMOS  logic  seems  to  be  adequate  for  the  prototype. 
Moreover,  as  we  shall  see,  the  design  is  parametrized,  and  can  accommodate  with  only  minor 
modifications  different  number  of  packets  per  message,  and  different  sizes  for  each  packet.  In 
particular,  it  will  be  possible  'o  take  advantage  of  padcages  with  increased  pin  count  with  no  essen- 
tial change  in  the  system  design. 

4.2.   External  Protocols 

The  internal  structiire  of  the  USW  is  simplified  if  one  assumes  that  message  transmission 
occurs  at  consecutive  cycles  (i.e.  the  transmission  of  a  message  cannot  be  halted  in  the  middle). 
As  we  shall  see  in  the  sequel,  it  is  also  advantageous  to  start  message  transmission  only  at  even 
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cycles.    The  interswitch  communication  protocol  is  designed  to  achieve  these  two  goals.    It  is  a 
message  level  protocol,  not  a  packet  level  protocol. 

A  cycle  is  even  for  a  given  switch  if  the  parity  of  this  cycle  is  the  same  as  the  parity  of  the 
stage  to  which  the  switch  belongs  (i.e.  cycles  which  are  even  for  a  switch  are  odd  for  its  prredeces- 
sors  and  successors).  The  reception  of  a  message  may  start  only  at  an  even  cycle;  The  transmis- 
sion of  a  message  may  start  only  at  an  odd  cycle.  A  switch  is  willing  to  accept  a  new  message 
only  if  tlie  available  space  in  the  corresponding  buffers  guarantee  that  it  will  be  able  to  receive  the 
entire  message.   This  requirement  allows  more  signal  overlap. 

The  following  protocol  is  used  for  message  transfer.  With  each  set  of  data  lines  are  associ- 
ated two  control  signals:  A  sender  asserts  the  Data^Valid  line  v.hen  it  wishes  to  initiate  a  message 
transmission.  Independently,  a  receiver  asserts  the  Data_Accept  line  when  it  is  ready  to  accept  a 
new  message.  A  message  transfer  may  start  therefore  only  if  both  Data_Valid  and  Data^Accept 
are  set,  and  the  cycle  parity  is  correct.  The  control  signals  are  ignored  at  cycles  where  a  message 
transfer  cannot  be  started  thus  enabling  these  signals  to  be  set  ahead  of  time.  Note  that  this  is  not 
strictly  speaking  an  hand  shaking  protocol:  Data_Accept  is  not  an  answer  to  Data_Valid,  nor  an 
acknowledgment  for  reception,  but  is  issued  independently  and  simultaneously.  The  sender  is 
asserting  the  data  on  the  data  lines  whenever  it  is  ready  to  send  (i.e.  whenever  DatA_Valid  is  set). 
If  it  receives  the  Data_Accept  signal,  it  assumes  the  data  has  been  accepted  and  proceeds  with  the 
next  packet.   No  provision  for  retry  are  made. 

If  the  return  path  of  the  USW  resides  on  a  separate  chip  die  same  protocol  is  used  for 
transmission  from  the  Combine_Queue  to  the  Wait  Buffer.   There  are  two  differences,  however. 

1.  The  Hrst  packet  of  a  message  sent  from  the  Combine.Queue  to  the  Wait  Buffer  is  received 
at  an  odd  cycle. 

2.  The  transmission  of  a  message  may  be  cancelled  (This  holds  true  for  the  more  complex 
scheme  described  for  the  Combine_Queue  at  the  end  of  sec.  4.3.1.  :  when  a  messages  starts 
leaving  the  combine  queue  it  might  not  yet  be  known  whether  it  matches  its  mate).  We  use 
a  high  Data_Valid  signal  at  the  cycle  where  the  last  address  packet  is  sent  for  confirmation. 
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4.3.   Systolic  queues 

A  basic  cxjmponent  of  the  USW  is  the  FIFO  buffer.  Gtiibas  and  Liang  present  in  [GL]  a 
VLSI  implementation  of  a  FIFO  buffer  where  an  insertion  or  deletion  can  be  performed  every 
four  cycles,  and  where  no  global  control  signals  are  used,  short  of  the  two  dock  signals  used  by 
the  two-phase  logic.  We  use  in  the  USW  a  modified  version  of  this  structure,  where  insertions 
and  deletions  can  essentially  be  made  at  each  c>de.  To  achieve  this  goal  we  must  resort  to  an 
increased  number  of  global  control  signals.  The  queue  consists  of  Uvo  columns  of  shift  registers, 
&:i  IN  column,  and  the  OUT  column,  as  described  in  Figure  2. 
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FiguTR  2. 


In  the  nonnal  operating  mode  of  the  queue  one  packet  is  deleted  at  each  acle,  provided  that 
the  queue  is  not  empty.  Also,  one  packet  may  be  inserted  at  each  cycle.  The  number  of  cycles 
with  no  insertions  between  to  consecutive  insertions  must  be  even  (this  includes  a  zero  length 
interval,  that  is  packets  inserted  at  consecutive  cycles).   New  packets  are  inserted  at  the  bottom  of 
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the  IN  cxjlumn  and  shift  up  one  position  cadi  cycle.  Similarly,  packets  shift  down  one  position  on 
the  OUT  column  at  each  cycle,  with  the  bonom  most  packet  being  deleted.  However,  if  a  packet 
in  the  IN  column  faces  a  slot  on  the  out  column  that  wiU  be  empty  at  the  next  cycle,  then  it  shifts 
to  that  slot. 

Let  /,  be  the  packet  in  slot  j  of  the  IN  column,  Oj  be  the  packet  in  slot  j  on  the  OUT 
column,  and  emprylj,  emptyOj  be  flags  which  are  set  to  TRUE  when  the  corresponding  slot  is 
empty.  Our  clocking  scheme  is  the  standard  two  phase  (phil,  phi2)  scheme  described  in  Mead 
and  Conway  [MC].  The  phase  1  transitions  of  the  dock  are  just  the  vertica]  movements  described 
above,  that  is 

1.  rj:=Ij.,; 

2.  0'j:=Oj.,; 

The  phase  2  transitions  of  this  queue  at  each  cyde  are  described  by  the  following  equations: 

1.  1'/.=!/. 

2.  O'j  :=  UempcyOj  - 

then  Ij 
eke   Oj-, 

3 .  empty! ' , :  -  emptylj  or  emptyOj ; 

4.  emptyO' j  :=  emptyO,  and  emptyl,. 

Initially  all  the  cells  are  empty.  The  boundary  cases  are  handled  by  postulating  a  per- 
manently empty  cell  at  the  end  of  the  OUT  column;  at  the  end  of  the  fu^t  phase  /g  is  either 
empty,  or  contain  a  new  inserted  packet;  at  the  end  of  the  second  phase  Oq  contains  the  deleted 
packet.  Note  that  if  the  queue  is  empty  than  a  newly  inserted  packet  in  /g  moves  directly  to  Oq, 
thereby  bypassing  the  queue. 

The  correctness  of  this  scheme  follows  from  the  following  invariant  assertions, 
(i)      The  full  slots  in  the  OUT  column  ocrupy  an  initial  segment  of  consecutive  locations  on  that 
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column. 

(ii)  K  difference  between  the  largest  index  of  a  full  location  on  the  OUT  column  and  the  largest 
index  of  a  full  location  on  the  IN  column  is  nonnegative  and  even. 

Two  anomalous  situations  may  ocoir: 

1.  No  deletions  are  performed  (the  buffer  is  blocked).  Then  no  movement  occurs  on  the  OUT 
column.  The  relative  movement  of  the  IN  column  with  respect  to  the  OUT  column  that 
occurs  on  each  normal  cycle  can  be  obtained  in  two  cycles  in  the  blocked  state.  The  IN 
column  will  shift  up  at  the  Hrst  phase  of  each  of  these  cycle,  but  a  transfer  from  the  IN 
column  to  the  OUT  column  may  occur  only  at  the  second  phase  of  each  second  cycle.  On 
the  second  phase  of  odd,  "illegal"  cycles  of  a  blocked  period  all  transitions  are  identity  transi- 
tions. On  the  second  phase  c'  even  cycles  of  a  blocked  period  a  normal  transition  is  per- 
formed, as  in  the  unblocked  case. 

It  is  easy  to  see  that  the  two  invariant  assertions  are  fulfilled  at  the  end  of  even  cycles  when 
the  buffer  is  blocked.  The  control  logic  ensures  that  the  buffer  is  blocked  for  an  even  number  of 
cycles. 

2.  The  buffer  is  full:  A  buffer  is  declared  full  when  it  has  been  blocked,  for  an  even  number  of 
cycles,  and  less  than  two  empty  slots  are  left  on  the  CJ  column.  When  the  buffer  is  full  it 
does  not  change  state,  so  that  all  the  transitions  are  identity  transitions. 

The  structure  described  above  can  be  implemented  with  five  global  control  lines  (dock  sig- 
nals included).  Furthermore,  these  global  signals  can  be  predicted  one  cycle  in  advance,  thus 
allowing  the  overlapping  of  processing  with  signal  distribution,  see  the  Bcllniac-32A  design  [Sh] 
The  logic  required  to  generate  these  signals  from  two  phase  dock  signals  is  simple.  K  message 
transmissions  are  initiated  by  the  PE's  only  at  even  cydes  then  at  each  switch  transitions  which  are 
assodated  with  messages  will  occur  at  even  time  intervals.  In  particular,  the  restrictions  on  the 
number  of  cydes  between  two  consecutive  insertions,  and  the  number  of  q,des  a  buffer  is  blocked 
are  automatically  fulfilled,  and  do  not  impair  the  buffer  performance.  Each  register  ceD  is  con- 
nected to  four  other  cells,  one  above  and  one  beyond  on  the  same  column,  and  two  from  the 
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opposite  column.  The  number  of  connections  can  be  reduced  to  three  if  one  one  is  willing  to 
accept  insertions  and  deletions  every  two  c>des,  by  having  tlie  two  columns  move  at  alternate 
cycles. 

It  is  possible  to  merge  several  systolic  queues  with  the  same  output  into  one  combined  struc- 
ture. Such  queue  will  have  multiple  IN  columns,  one  for  each  input,  but  a  unique  OUT  column. 
Extra  arbitration  logic  is  rec^uired  to  arbitrate  conflicting  transfers  to  the  same  slot  on  the  OUT 
column. 

Consider  a  systolic  queue  with  two  IN  column  and  one  OUT  column.  One  priority  bit  is 
needed  at  each  slot  to  indicate  which  of  the  two  IN  columns  have  priority  in  transfers  to  the  OUT 
column.  However,  if  one  of  the  IN  columns  has  initiated  a  transfer  to  the  OUT  column,  thJs 
transfer  must  be  pursued  for  all  the  consecutive  packets  of  a  message.  This  requires  extra  informa- 
tion to  be  available  in  the  queue:  message  formatting  signals  that  indicate  the  start  and  the  end  of 
a  message,  and  a  signal  indicating  ongoing  transfer  from  the  IN  column  to  the  OUT  column.  Arbi- 
tration between  the  IN  columns  occurs  only  at  the  fu^t  packet  of  a  message.  The  transfer  flag  is 
set  as  the  result  of  this  arbitration,  and  cancelled  when  the  last  packet  of  the  message  is 
transfened.  A  more  detailed  description  of  a  similar  scheme  is  given  in  the  next  section. 

4.3.1,   Combine  queue 

The  simple  queue  described  in  the  preceding  section  stores  the  information  in  the  form  of 
packets  until  they  are  dequeued.  There  is  no  notion  of  data  structuring  other  than  single  packets 
and  their  ordering.  We  will  show  how  to  extend  this  notion  of  queue  to  include  considerably 
more  processing  power  by  the  addition  of  more  sophisticated  control  logic.  The  resultmg 
Combine_Queue  not  only  stores  messages  but  will  be  able  to  detect  and  segregate  a  message 
which  is  destined  for  the  same  address  as  another  message  already  in  the  queue.  Such  messages 
(if  they  also  contain  the  same  Op)  are  candidates  for  combining,  and  if  chosen  to  combine  only 
one  of  the  two  messages  (we  will  in  the  actual  implementation  only  combine  pairs)  will  be  for- 
warded to  the  MM.  Note  tliat  the  serialization  principle  does  not  require  combination.  The  other 
message  is  held  at  the  switch  in  the  Wait.Buffer  until  the  response  returns  from  the  Memory 
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Module.   It  is  here  that  the  assumption  that  the  retiirn  path  of  a  reply  is  identical  svith  the  forward 
path  of  the  request  is  reqxiixed. 

As  in  the  case  of  the  simple  queue,  the  initial  segment  of  the  OUT  column  contains  0  or 
more  consecutive  non-empty  packets.  A  message  which  enters  the  queiw  is  either  combined  with 
some  packet  in  the  initial  segment  of  the  OUT  column,  or  if  no  candidate  is  available,  it  will  be 
appended  to  the  end  of  the  OUT  column.  Thus,  in  the  absence  of  combinations  the 
Q)mbine_Queue  acts  as  a  simple  queue.  Tlie  message  movements  within  the  G3mbine_Queue  are 
described  below. 

As  a  new  message  moves  up  the  IN  column  of  a  FIFO  buffer  it  passes  by  all  the  messages 
that  were  in  the  buffer  at  the  time  of  its  arrival.  By  adding  extra  comparison  logic  at  each  packet 
level  in  the  queue,  it  is  possible  to  compare  the  address  of  the  new  message  against  that  of  each 
succeeding  old  message,  to  spot  candidates  for  combination.  Messages  which  match  and  are  to  be 
combined  are  removed  from  the  IN  column,  preventing  them  from  be  appended  to  the  end  of  the 
OUT  column.  Such  messages  must  be  associated  with  the  matched  one  in  the  OUT  column  so 
that  the  combination  can  be  fjerformed  on  their  respective  data  fields.  To  segregate  the  matching 
message,  a  third  column  is  required  which  is  dubbed  the  CHUTE.  A  message  in  the  IN  column 
that  matches  a  message  in  the  OUT  column  is  transferred  to  the  CHUTE  column.  A  Message 
moves  down  the  CHUTE  at  the  same  rate  as  the  corresponding  message  on  the  OUT  column,  tan- 
deming  the  message  against  which  it  matched.  A  pair  of  messages  to  be  combined  will,  therefore 
exit  at  the  same  cycle,  the  old  one  from  the  OUT  column  and  the  new  one  from  the  CHUTE 
column,  thus  entering  the  Combine  ALU  simultaneously.  An  old  message  can  be  matched  with  at 
most  one  new  message  after  which  the  CHUTE  is  full  and  will  not  accept  a  second  message. 

The  overall  structure  of  the  Combine_Queue  is  illustrated  in  Figure  3. 
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Figure  3. 


We  first  describe  the  logic  of  the  Combine_Queue  under  the  assumption  that  each  message  occu- 
pies two  packets.  The  first  packet  contains  the  address  consisting  of  the  routing  information,  the 
interna]  MM  address  and  the  opcode.  Tlie  second  packet  contains  the  data.  The  two  successive 
packets  of  a  message  arc  always  received  on  consecutive  cycles. 

The  movement  Oi  data  on  the  IN  and  OUT  columns,  and  the  transfers  from  the  C^  column 
to  the  OUT  column  follow  the  pattern  described  above  for  the  FIFO  buffer. 


-  19- 

We  say  that  a  packet  in  the  IN  column  meets  a  packet  from  the  OUT  column  \vhen  they  are 
both  at  the  same  horizontal  level  in  the  queue.  In  the  normal  state,  when  items  in  both  the  IN 
and  OUT  columns  move  one  position  each  cycle,  an  item  on  the  IN  column  meets  only  half  of  the 
items  moving  on  the  OUT  column.  We  have  previously  stipulated  that  both  messages  and  the 
interval  between  messages  have  even  length.  This  guarantees  that  the  i-th  packet  of  a  message 
moving  up  the  IN  column  meets  the  i-th  packet  of  each  message  moving  on  the  OUT  column. 
Thus  each  pair  of  packets  that  have  to  be  compared  meet  (this  holds  true  even  if  messages  have 
length  larger  than  two). 

Al  each  cycle  I,,  the  content  of  cell  j  on  the  IN  column,  is  compared  with  Oj,  the  content  of 
cell  j  on  the  OUT  column.  A  match  occurs  if  both  cells  contain  the  address  packet  of  two  mes- 
sages that  can  be  combined.  Note  that  this  implies  that  the  logic  is  able  to  recognize  address  pack- 
ets. The  bits  that  have  to  be  compared  are  those  of  the  internal  address,  the  opcode,  and  that 
part  of  the  path  descriptor  that  identifies  the  MM  Thus,  the  number  of  bits  that  are  compared 
depends  upon  the  stage  of  the  network  to  which  the  switch  belongs,  and  a  suitable  mask  has  to  be 
preprogrammed  at  each  switch.  If  a  match  is  found  and  the  relevant  location  in  the  CHUTE  is 
empty,  then  the  address  packet  b  transferred  to  the  CHUTE.  In  the  implementation,  the  transfer 
will  be  executed  whenever  the  CHUTE  is  empty;  the  slot  will  be  marked  as  fuU  only  if  there  was  a 
match.  Al  the  next  cycle  the  message's  data  packet  has  to  be  transferred  to  the  next  location  in 
the  CHUTE.  Because  both  columns  are  moving,  this  transfer  occurs  at  the  same  level  where  the 
previous  transfer  occurred.  A  chute_transfer  flag  has  to  be  set  when  a  match  occurs  so  that  the 
information  v^ill  be  passed  to  the  next  cycle,  indicating  that  the  data  is  also  to  be  transferred.  As 
before  we  describe  the  movements  of  the  queue  in  two  phases;  the  first  consists  of  the  vertical 
movements  when  the  IN  signals  move  up  and  the  OUT  (and  CHUTE)  signals  move  down.  The 
second  phase  specifies  the  control  and  lateral  movements.  The  second  phase  is  formally  described 
by  the  following  transitions. 
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1.  /',  :=  If, 

2.  O'j  :=  If  emptyOj  then  Ij  else  Of, 

3.  C'j   :=  If  emptyCj  then  Ij  else  C^; 

4  emptyl',  :=  emptylj  or  (match(/y,Oy)  and  emptyCj)  or  chute _transferj  or  emptyOf, 

5.  emptyO'j  :=  emptyOj  and  emptyl f, 

6.  emptyC'j  :=  emptyCj  and  not  mfltch(/y,Oy)  and  not  chute jransf erf, 

7.  chute jransfer' j  :=  match(/y,Oy)  and  emptyCj. 

The  above  describes  the  case  where  both  the  IN  column  and  the  OUT  (also  CHUTE)  are 

moving.  There  are  times  when  the  queue  will  fill  up;  this  event  will  cause  the  switch  at  the 
preceding  stage  to  become  blocked.  In  that  event  the  switch  which  is  unable  to  send  (OUT  and 
CHUTE  stationary)  will  still  be  able  to  receive  (IN  moving)  until  that  queue  fills  up.  Since  the 
elements  on  the  IN  column  move  at  only  half  the  normal  speed  relative  to  elements  on  the  OUT 
column,  every  other  cycle  is  spent  just  moving  signals  up  the  IN  column  without  any  computa- 
tions. The  cycles  without  computation  are  the  odd  cycles.  Note  however  that  in  the  blocked 
state,  if  an  address  packet  on  the  IN  column  matched  at  level  j  an  address  packet  on  the  OUT 
column,  the  transfer  of  this  packet  to  the  CHUTE  will  occur  at  level  j  as  before,  but  the  transfer 
of  the  following  data  packet  will  occur  two  cycles  laner  at  level  j-(- 1.  Thus  the  chute_transfer  flag 
has  to  be  shifted  up.  Signals  such  as  the  chute_transfer  flag  which  are  message  oriented  signals 
(versus  packet  signals)  are  called  systolic  control  signals.  Systolic  control  signals  shift  up  one  posi- 
tion on  odd  cycles  in  the  blocked  state,  and  are  otherwise  static.  Thus,  in  the  blocked  state  data 
on  the  IN  column  sliilts  up  at  each  cycle,  systolic  signals  shift  up  on  the  odd  cycles,  whereas  the 
transitions  at  the  second  phase  of  even  cycles  arc  the  same  as  in  the  unblocked  case.  The  stipula- 
tion that  a  new  message  begins  only  at  an  even  cycle  enforces  a  correct  synchronization  of  the 
transitions  of  the  systolic  control  signals  with  the  transitions  of  the  messages. 

The  final  global  state  of  the  queue  is  fitll.  When  the  queue  is  full  all  transitions  are  identity 
transitions.  This  completes  the  description  of  the  queue  under  the  assumption  of  two  packet  mes- 
sages. 

The  phase  one  movements  of  the  signals  are  summarized  below  for  eo.ch  of  the  queues  vari- 
ous states.    The  queue  contains  three  types  of  signals  —  packet  formatting  signals  (for  example 
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data_packet,  address_packet),  cell  control  signals  such  as  eraptyO,  eraptyC,  and  finaUy  systolic 
control  signals  such  as  chute_tramfer.  All  three  of  the  signals  are  uni-direction;  signals  on  the  ]N 
column  move  up,  OUT  and  CHUTE  column  move  down,  with  the  exception  of  systolic  control 
signals  which  are  sometimes  stationary.  Both  of  the  packet  signals  are  always  associated  with  a 
given  packet  in  a  message.  The  systolic  control  signals  move  through  the  message  to  visit  the  1st, 
and  then  2nd  packet  in  the  message. 

The  packet  and  cell  oriented  signals  always  move  vertically  when  their  column  is  moving.  In 
the  unblocked  state,  all  packet  signals  move  vertically  during  phase  1.  In  the  blocked  stated,  only 
IN  packet  signals  move  vertically.  In  the  unblocked  state,  systolic  signals  stay  in  place  at  a  given 
physical  slot  while  successive  packets  of  both  OUT  and  IN  messages  reach  the  physical  slot  on  suc- 
cessive cycles.  In  the  blocked  state,  systolic  signals  must  move  only  on  odd  cycles  to  accomplish 
the  same  goal.  Since  the  blocked  state  differs  from  the  unblocked  state  only  in  respect  to  a) 
movement  of  systolics  on  odd  cycles  and  b)  vertical  movements  of  packet  signals  on  odd  cycles, 
we  will  not  discuss  the  case  of  blocked  or  full  queues  when  wc  describe  the  more  general  combine 
queue. 

This  scheme  may  be  adapted  to  messages  with  other  formats,  provided  that  the  number  of 
packets  in  the  address  part  of  the  message  is  fixed,  and  the  first  and  last  packet  of  each  message  is 
tagged.  The  start  packet  of  eadi  message  on  the  IN  column  meets  the  start  packet  of  each  mes- 
sage on  the  OUT  column.  'V^Tien  a  message_start  packet  on  the  IN  column  meets  a  message_5tart 
packet  on  the  OUT  column  a  sequence  of  comparisons  for  match  is  initiated.  The  results  of  the 
comparisons  on  all  of  the  address  packets  of  the  message  are  anded  to  get  the  result  for  the  mes- 
sage. Since  these  comparisons  take  place  on  successive  cycles  (or  every  even  cycle  in  the  blocked 
case),  the  result  is  stored  in  a  systolic  control  signal  called  match_flag.  The  comparison  phase 
ends  when  the  last  two  address  packets  are  compared.  Note  that  in  the  case  of  a  queue  with  both 
IN  and  OUT  columns  moving,  all  these  successive  comparisons  are  made  at  the  same  physical 
level  in  the  structiu'e.  When  a  successful  match  occured  then  a  transfer  to  the  CHUTE  column  is 
initiated,  and  continues  until  the  end  of  the  message  is  detected.  The  chute_transfer  systolic  signal 


inciicates  such  ongoing  transfer.   Transfers  to  the  OUT  column  are  performed  in  a  similar  manner, 
and  an  out_transfer  systolic  signal  controls  them. 

We  present  below  the  transition  equations  of  the  second  phase  of  a  normal  cycle,  for  mes- 
sages of  even  (varying)  length,  containing  each  two  address  packets. 


1.  /';  :=  /;; 

2.  ^'/  •"  \S outjn-ansferj  or  {message_fUirtIj  and  emptyOj  and  inessage_endOj_,^ 


thealj 
e\seOj; 

3.  C'j   :=  \I chute_tra7isferj  or  (mgssage_jtanlj  and  missase_endOj_-  and  ernptyCj+{) 

then/y 
elseCy; 

4.  emptyl'  J  :=  empty  I,  or  {emptyO,.'^  and  message_endO,_-^ 

or  (match(/y,<9y)  and  match Jlagj  and  emptyCj); 

5.  emptyO'j  :=  empty-O,  and 

(emptylj^.,  or  not  m£ssage_startj^.-^; 

6.  emptyC' J  :=  emptyCj 

and  not  (niatch(/y,Oy)  and  match_fUig,  and  not  emptyl.) 

7.  match_flag' j  :=  message_eruilj_.,  and  message _,itanOj 

and  not  emptyOj  and  raatdi(/y,  Oy); 

8.  out_transfer'j  :—  {message _startlj  and  emptyO,  and  m£ssage_endO.^^ 

or  (out_transferj  and  not  message_endlj)\ 

9.  chutejransfer' J  :=  (niatch(/y,C>y)  and  matchjlagj  and  emptyCj  and  not  emptyL) 

or  {chute JtransfeTj  and  not  message_endlj); 


Exactly  two  packets  per  address 
The  boundary  conditions  are  obtained  by  posttJating  iJiat  a  match  always  fails  in  the  first 

stage,  and  that  the  OUT  and  CHUTE  columns  contain  empty  cells  in  the  stage  after  the  last  one. 

If  these  boundary  cells  have  also  the  raessage_end  bit  on,  then  it  is  not  necessary  to  initialize  the 

structure:  all  the  internal  registers  will  be  initialized  correctly  by  running  through  a  sufficiently 

large  number  of  normal  cycles  with  null  inputs.  Note  that  a  message  on  the  CHUTE  column  is 

two  stages  off  from  the  associated  message  on  the  OUT  column,  so  that  the  CHUTE  column 

should  be  shorter  by  two  stages  (we  postulated  that  matches  fail  at  the  first  stage,  so  that  transfers 

to  the  missing  stages  on  the  CHUTE  column  are  not  attempted).   Note  that  an  incoming  message 

will  not  combine  with  a  message  which  starts  leaving  the  queue  at  the  same  cycle. 
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We  mentioned  that  the  set  of  address  bits  that  have  to  be  compared  varies  from  stage  to 
stage.  If  there  are  several  address  packets,  then  the  set  of  bits  to  be  compared  in  the  successive 
packets  of  the  message  is  also  varying.  We  require  the  stage  dependent  set  of  address  bit  to  be 
compared  occur  in  the  fu^t  jjacket,  with  all  the  bits  of  the  remaining  address  packet  are  used  in 
the  comparison.  Thus,  the  stage  dependent  comparison  mask  is  used  for  packets  with  the 
message_start  signal  on. 

The  structure  described  by  the  previous  transition  equations  carries  redundant  information, 
and  that  can  be  used  to  save  hardware.  For  example  the  message_start  signal  is  redundant,  since 
it  is  on  only  at  the  first  nonempty  cell  foUowing  a  cell  that  is  empty  or  has  the  message_end  bit 
on. 

A  similar  scheme  applies  for  any  fixed  number  of  address  packets.  However,  if  the  number 
of  packets  in  the  address  part  of  the  message  is  k,  then  the  occurrence  of  a  match,  and  hence  the 
validity  of  a  transfer  to  the  CHUTE  can  be  certified  only  k  cycles  after  the  comparison  and  the 
transfer  begin.  Thus,  the  control  logic  at  the  j-th  level  of  the  structure  must  be  connected  to  level 
number  j-l-k  on  the  IN  column  and  level  j-k  on  the  CHUTE  column,  in  order  to  set  correctly  the 
empty  bits.  Also,  transfers  to  the  CHUTE  column  cannot  be  initiated  at  the  lowest  levels  of  the 
queue,  so  that  merging  occurs  only  when  the  queue  fills  up. 

An  alternative  scheme  may  be  used  to  obviate  this  problem.  Empty  flags  will  be  set  on  the 
OUTPUT  and  CHUTE  column  only  on  those  cells  where  transfers  may  be  initiated. 

Note  that  if  a  message  can  be  combined  with  a  previous  one,  only  the  first  of  its  address 
packets  is  distinct  and  has  to  be  transferred  to  the  CHUTE.  This  prevents  destruction  of  valid 
data  in  the  CHUTE  since  the  address  packet  is  written  only  when  it  is  near  the  flag  specifying 
whether  the  CHUTE  position  is  empty.  We  also  mark  the  first  packet  cf  each  message,  the  last 
packet  of  each  message,  and  the  last  address  packet  of  each  message  with  identifying 
raessage_start,  message_end  and  address_end  flags  respectively.  These  control  bits  will  serve  to 
sequence  through  the  various  operations  to  be  performed  on  messages  in  the  queue,  without  any 
wired-in  specification  of  address  and  message  size.   The  length  of  each  message,  and  the  length  of 
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intervals  between  message  arrivals  are  assumed  to  be  even.  Finally,  assume  for  simplicity  that  the 
address  part  of  a  message  contains  a  fixed,  larger  than  two,  even  number  of  packets.  A  similar 
scheme  can  be  devised  for  odd-length  address  parts. 

The  flavor  of  such  scheme  yields  a  sequence  of 

initiate  -  abandon  or  complete 

in  which  operations  which  are  initiated  can  be  abandoned  at  some  later  time  (when  more  informa- 
tion becomes  available)  as  long  as  such  abandonment  does  not  effect  the  queue  in  a  non-local 
manner. 

We  describe  the  functioning  of  the  Combine  queue  in  the  normal  state  where  packets  in  the 
IN  column  moves  up  one  position  each  cycle,  and  packets  in  the  OUT  and  CHUTE  columns  shift 
down  one  position  ei;  -h  cycle. 

The  start  packet  of  the  message  is  transferred  to  the  CHUTE  when  it  passes  by  the  last 
packet  of  the  address  part  of  a  message  on  the  OUT  column,  provided  the  corresponding  location 
on  the  CHUTE  is  empty.  It  moves  down  on  the  CHUTE  column  at  subsequent  cycles.  Thus, 
when  the  comparison  of  the  address  packets  of  a  message  on  the  IN  column  with  a  message  on  the 
OUT  column  is  completed,  the  start  packet  of  the  message  on  the  IN  column  is  at  the  level  where 
the  comparisons  where  made.  K  they  were  successful,  the  ceO  on  the  CHUTE  column  containing 
the  start  packet  will  be  marked  as  full,  inhibiting  further  transfers.  Also,  in  case  of  a  successful 
match,  two  more  actions  occur.  Fu^t,  the  last  address  packet  of  the  message  in  the  IN  column  is 
marked  empty.  Second,  a  chute_transfer  flag  is  set  on  at  the  level  where  the  comparisons  where 
done.  Successive  data  packets  arriving  at  this  level  on  the  IN  column  will  be  transferred  to  the 
CHUTE.  The  chute_transfcr  flag  is  disabled  after  the  last  packet  of  the  message  has  been 
transferred.  Subsequent  attempts  to  transfer  a  message  to  the  CHUTE  will  always  start  at  the  slot 
that  contains  the  address_end  packet,  and  which  is  marked  full.  Thus,  it  is  not  necessary  to  mark 
the  other  packets  as  full. 

The  empty  bit  is  correctly  set  in  the  IN  column  only  at  the  last  address  packet  of  a  message. 
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This  reqxiires  change  to  the  logic  of  the  transfers  from  the  IN  column  to  the  OUT  column.  Slots 
on  the  OUT  column  normally  have  the  message_end  and  empty  bits  on.  A  transfer  from  the  IN 
column  to  the  OUT  column  is  initiated  when  a  message  start  packet  on  the  IN  column  meets  an 
empty  slot  on  the  OUT  column  following  a  slot  with  message_end  on.  When  transfer  is  initiated  a 
out_transfer  flag  is  set  on,  and  successive  packets  getting  on  the  OUT  column  to  the  level  where 
the  transfer  was  initiated  will  transfer  to  slots  on  the  OUT  column.  The  empty  bit  ir.  the  message 
start  packet  that  was  transferred  to  the  OUT  column  is  set  off  when  the  packet  is  passing  across 
the  address  end  packet  of  the  IN  message  which  contains  a  full  bit.  Thus,  further  transfers  of  data 
to  this  slot  wil]  be  inhibited.  If,  on  the  other  hand,  the  address  end  packet  had  its  empty  bit  on, 
the  transfer  is  not  validated,  and  may  be  latter  overwritten  by  a  new  transfer.  Note  that  a 
transfer  to  one  of  the  slots  on  the  OUT  column  containing  invalid  data  packets  will  always  be  ini- 
tiated at  the  first  slot  which  has  a  correct  value  for  its  empty  bit,  so  that  it  is  not  necessary  to 
mark  the  other  slots  as  full. 

This  solution  is  not  yet  correct:  The  first  packet  of  a  message  shifting  down  on  the  OUT 
column  may  not  meet  the  last  address  packet  of  this  message  if  the  transfer  to  the  OUT  column 
occured  early.  The  empty  bit  on  the  IN  column  is  not  set  on,  and  the  empty  bit  on  the  OUT 
column  is  not  set  off. 

The  problem  can  be  solved  by  prohibiting  lateral  transfers  of  data  in  the  first  few  stages. 
Such  solution  entails  unnecessary  delays.  These  initial  stages,  where  no  lateral  movements  occur, 
and  no  comparisons  are  done,  need  not  exist  physically.  Their  effect  can  be  rendered  by  finite 
state  logic  that  monitors  incoming  and  outgoing  packets.  The  state  variable  bscate  of  this  "boun- 
dary automata"  assumes  three  values:  default,  in  which  holds  when  the  address  packets  of  a  mes- 
sage are  incoming,  and  inout  which  holds  when  address  packets  of  the  same  message  both  entering 
the  IN  column  and  leaving  the  OUT  column. 

An  incoming  message  is  detected  by  the  arrival  of  a  packet  with  the  address_start  flag  on; 
end  of  address  transmission  is  detected  by  the  arrival  of  a  packet  with  the  address_end  flag  on. 
While  a  message  is  incoming  the  IN  column  there  must  be  a  message  outgoing  the  OUT  column. 


Thus,  if  a  first  message  packet  marked  empty  reaches  the  first  stage  of  the  OUT  column,  this  is 
the  first  packet  of  the  incoming  message.  A  transition  to  the  inout  state  occurs,  and  that  packet  is 
marked  full.  If  a  last  address  packet  of  a  message  enters  the  IN  column  while  address  packets  of 
the  same  message  are  leaving  the  OUT  column,  then  this  packet  is  marked  empty. 

The  transitions  corresponding  to  this  lengthy  informal  description  are  given  below. 

In  the  first  phase  of  a  normal  transition  packets  on  the  IN  column  shift  up,  whereas  packets 
on  the  OUT  and  CHUTE  columns  shift  down.  An  incoming  packet  is  marked  empty,  unless  it  is 
a  last  address  packet,  and  bstate  *  inout. 

The  transitions  on  the  second  phase  are  given  below. 
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1.  /';   :=  /; 

2.  O'j  :=  If  {message_endOj.^  and  emptyOj  and  message _ftarUj) 

or  outjrransferj 

then  /y  {transfer  from  IN} 

ebeOy; 

3.  C;   := 

\t  emptyCj+i  and  message_starUj  and  address_endOj_,^ 
then  /,   {transfer  of  start  packet  from  E^ 
elseif  chutejransferj  then  /y  {transfer  of  data  packets  from  I>f} 
else  Cy; 

4.  empty!' J  :=  emptylj 

or  {matchjlagj  and  match(/y,Oy)  and  emptyCj) 
or  {emptyOj^i  and  message_endOj.n^ 

5.  emptyO'j  :=  emptyOj  and 

{emptylj+^  or  not  message jstarxOj^^^ 
emptyO' Q  :=    emptyOQ  and  not 

{{message_jtarrOQ  and  bsuae^addin)  cr  {message _^tardQ  and  message_enJO.,)) 

6.  emptyC' j  :-  emptyCj 

and  not  {matchjlagj  and  match(7y,0y) 
and  address_endlj  and  not  emptylj) 

7.  match_flag' j  := 

{{message _startl J  and  message _endOj^i  and  not  empcyOj)  or  matchjlagj) 
and  raatch(/y,C>y)  and  not  address_endlj 

8.  outjTonsfer'j  :=  {outjransferj  or 

{message_endOj^i  and  emptyOj  and  message_ftartlj)) 
and  not  message  —  endl, 

9.  chutejtransfer' J  :=  {chute Jransferj  or 

{matchjlagj  and  match(/y,Oy) 

and  address_endlj  and  emptyCj  and  not  emptylj)) 
and  not  messa^e_endli 


For  any  even  number  of  address  packets 

The  boundary  conditions  are  obtained  by  postiilating  that  the  at  the  end  of  the  OUT  column 
there  is  a  cell  which  is  empty  and  has  the  message_end  signal  on,  an  empty  cell  at  the  end  of  the 
CHUTE  column,  and  that  matches  fail  at  stage  -1.  Note  that  the  values  of  the  signals 
message_endO.^  and  empty_chute_,  are  used  to  set  signals  at  the  first  stage.  (Eq.  2  and  Eq.  9). 
There  are  obtained  by  storing  the  message.end  flag  of  the  last  packet  that  left  the  OUT  column 
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and  the  empty  flag  of  the  last  packet  to  leave  the  CI-IUTE  column. 
The  transitions  of  the  "boundary  automata"  are  defined  below. 


old 
state 

new  state 

d^ 

in 

inout 

def 

not  message_stanlQ 

message_startlQ 
and  not  {mfssage_endO^^ 
and  emptyO(^ 

message_^iartlQ 
and  message_endO^-^ 
and  emptyOQ 

in 

address_endlQ 

not  address  eruU^ 
and  not  {message _ftarxOQ 
and  emptyOf^ 

message_ftartOQ 
and  emptyOQ 

inout 

address  endJ^ 

not  address  eruO^ 

When  the  queue  is  blocked  then  packets  on  the  IN  column  move  up  at  the  first  phase 
of  each  cycle,  systolic  signal  move  up  at  the  fast  phase  of  each  odd  cycle,  and  a  normal  tran- 
sition occurs  at  the  second  phase  of  each  even  cycle.  At  the  second  phase  of  an  odd  c>'de  no 
transition  occurs,  with  the  exception  of  the  transition  of  the  boundary  automata. 

As  in  the  previous  structure,  a  message  on  the  CHUTE  column  is  two  stages  off  from 
the  matching  message  on  the  OUT  col'imn.  Thus,  the  last  two  stages  on  the  CHUTE  column 
are  missing,  and  transfers  to  these  two  stages  are  inhibited. 

We  mentioned  before  the  possibility  of  merging  together  two  combine  queues  that  feed 
the  same  output  port.  The  merged  queue  has  two  IN  columns,  one  CHUTE  column  and  one 
OUT  column,  and  an  arbitration  mechanism  is  used  at  each  stage  to  resolve  transfer  con- 
flicts. 

Such  structure  is  not  compatible  with  the  latest  version  of  a  combine  queue  we 
presented,  where  transfers  can  be  initiated,  and  latter  abandoned.  If  this  philosophy  is  used, 
then  a  struaure  with  two  OUT  columns  and  two  CHUTE  column  is  needed.  The  arbitration 
logic  will  guarantee  that  in  only  one  of  two  CHUTE  columns  positions  be  marked  full  at  any 
level,  and  only  one  of  the  two  OUT  positions  be  marked  full. 

The  CHUTE  and  OUT  column  outputs  are  fed  into  the  combine  logic  where  messages 
are  actually  combined.  The  value  of  the  outgoing -packet  of  the  combined  message  is  either 
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the  value  of  the  packet  coming  from  the  CHUTE  coliunn  (address  packets  and  data  packets 
on  a  STORE),  or  the  sum  of  the  two  packets  (DATA  packets  on  a  FETCH  and  ADD). 
Additions  are  pipelined  over  the  succeeding  data  packets  of  a  message. 

4.4.   Walt_B>iffer 

The  Wait_Buffer  has  two  inputs  and  one  output:  message  for  storage  are  received  from 
the  Combine_Queue.  Messages  for  matching  are  coming  from  the  FromMM  input  port. 
These  messages  are  matched  against  all  the  messages  in  the  Wait_Buffer.  If  a  successful 
matdi  occurs  the  matched  message  is  released  to  the  ALU  where  it  is  combined  with  the 
corresponding  reply.  The  design  goals  for  this  structure  are  again  the  same  overall  design 
goals:  the  structure  is  pipelined,  and  is  able  to  accept  a  packet  on  each  input  line  with  no 
interference  between  the  two  inputs,  as  long  as  it  is  not  full.  The  probability  of  full 
Wait_Buffer  can  be  reduced  to  any  desired  level  by  inaeasing  the  structure  size.  The  use  of 
a  one  dimensional,  systolic-type  structure  to  support  these  functions  is  possible,  but  seems  to 
imply  inaeased  complexity  and  decreased  performance.  We  resort  therefore  to  a  more  con- 
ventional "two  dimensional"  structure  with  global  data  busses. 

The  Wait.Buffer  consists  of  a  sequence  of  slots  each  capable  of  storing  a  full  length 
message.  These  slots  are  connected  to  three  global  busses:  The  Input  Bus  for  incoming  mes- 
sages from  the  Combine.Queue,  the  Search  Bus  for  incoming  messages  '"rora  the  FromMM 
port,  and  the  Ouput  Bus  for  messages  leaving  the  Wait_Buffer.  These  three  busses  are 
packet  wide,  and  message  transmission  on  these  busses  is  pipelined.  Transmissions  may 
occur  concurrently  on  each  three  of  them.  The  data  on  the  search  bus  is  read  simultaneously 
by  all  the  slots  for  comparison.  Only  one  slot  at  a  time  may  read  from  the  input  bus,  and 
only  one  slot  at  a  time  may  write  on  the  output  bus. 

Each  slot  is  built  around  a  packet  wide  internal  bus.  To  this  internal  bus  are  connected 
three  circular  buffers:  the  address  ring  wliich  stores  the  packets  containing  the  address  of  the 
message,  the  data  ring  which  stores  the  packets  containing  the  data  of  the  message,  and  the 
extra  address  ring  which  stores  the  packet(s)  containing  the  address  part  of  the  message 
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fonvarded  to  the  MM  which  differs  from  the  address  part  of  the  message  stored  in  the  Wait 
Buffer.  In  addition,  the  internal  bus  is  connected  to  a  comparator  and  to  the  Input  Bus  and 
the  Output  bus.   The  second  input  to  the  comparator  comes  from  the  Search  Bus. 

At  any  given  time  the  interna]  bus  is  connected  to  a  unique  input,  the  Input  Bus  or  one 
of  the  rings,  and  to  a  unique  output,  the  comparator,  the  Ouput  Bus,  or  one  of  the  rings. 
The  slot  can  be  in  one  of  four  states:  EMPTY,  FILLING,  FULL,  or  EMFTYING.  Empty 
slots  competes  for  priority  at  cycles  where  message  transmission  ends  on  the  Input  Bus  or 
where  no  transmission  occurs.  A  daisy  chain  scheme  is  used  to  arbitrate  between  the  com- 
peting slots.  A  slot  that  achieves  priority  enters  the  FILLING  state  and  connects  to  the 
Input  Bus.  It  reverts  to  the  EMPTY  state  if  a  message  transmission  is  not  started  immedi- 
ately, or  if  a  message  transmission  is  cancelled. 

When  the  slot  is  in  the  FILLING  state  data  is  read  from  the  Input  Bus  and  into  the 
storage  rings.  At  message  end  the  slot  transits  to  the  FULL  state.  In  FULL  state  the  com- 
parator is  fed  from  the  address  rings.  U  a  successful  match  occurs  the  slot  transits  to  EMP- 
TYING state,  where  the  Output  Bus  is  fed  from  the  storage  rings  at  end  of  message  it  revers 
to  EMPTY  state.  The  state  transitions  (and  connections  of  the  different  rings  to  the  internal 
bus)  are  driven  by  the  match  signal  from  the  comparator,  the  priority  signal  from  the  arbi- 
tration mechanism,  and  by  the  formating  signals  on  the  data  lines.  As  for  the 
Qimbine.Queue  we  assume  that  flags  mark  the  first  packet  of  a  message,  the  last  address  of 
a  message,  and  the  last  packet  of  a  message.  When  a  slot  is  in  FILLING  state  its  transitions 
are  driven  by  the  formatting  signals  on  the  Input  Bus;  when  it  is  in  the  FULL  state  they  are 
driven  by  the  formatting  signals  on  the  Search  Bus;  finally,  when  it  is  in  the  EMPTYING 
state  they  are  driven  by  the  formatting  signals  on  the  Output  Bus. 

A  matched  message  leaves  te  Wait_Buffer  with  two  cydes  delay  with  respect  to  the 
matching  incoming  message.  Thus,  each  incoming  packet  has  to  be  stored  for  two  cydes, 
before  it  is  fed  uito  the  ALU  where  the  new  message  is  computed. 
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4.5.   Alternative  Combining  Scheme 

The  Wait_Buffer  implementation  described  above  requires  that  an  associative  search  be 
done  on  eadi  returning  message.  This  may  slow  down  the  switch,  if  Wait.Buffers  are  large. 
An  alternative  implementation  avoids  this  problem,  at  the  expense  of  having  an  extra  field  in 
cadi  message.  This  field  is  used  to  indicate  whether  this  is  a  combined  message,  and  if  so,  at 
which  stage  the  combination  occurred,  and  where  in  the  WaitJBuffer  at  that  stage  is  the 
relevant  information  stored.  Each  message  carries  a  stage  number  and  a  Wait.Buffer 
address,  with  a  special,  illegal  stage  number  indicating  a  non-combined  message.  A  message 
thus  contains  five  fields: 

1.  PD,  the  path  descriptor  (amalgamated  PE/MM  number); 

2.  CF  the  combine  field 

3.  lADD,  the  internal  MM  address; 

4.  OPC,  the  operation  opcode; 
5  DATA,  the  data  item. 

The  combine  field  consists  of  two  subfields  : 

1.  CFS,  The  stage  subfield,  and 

2.  CFA,  the  address  subfield. 

When  a  request  of  the  form  <pra  O-S  CFA  lADD  OPC  DATA>  matches  in  the 
combine  queue  a  previous  request  of  the  form  <p'm  CFS"  CFA'  lADD  OPC  DATA'> 
then  the  old  message  is  stored  in  the  wait  buffer,  together  with  p'  and  <CFS  CFA>, 
the  combine  field  of  the  new  message  (the  data  of  the  old  message  is  stored  only  if  the 
operation  is  a  Fctdi&Add).  The  new  message  is  forwarded  with  a  modified  combine 
field  <CFS"  CFA">,  where  CFS"  is  the  stage  number  of  the  switch,  and  CFA"  is  the 
address  where  the  old  message  was  stored  in  the  wait  buffer. 

When  a  reply  of  the  form  <pm  CFS"  CFA"  lADD  OPC  DATA>  is  received  at  a 
FromMM  port,  its    field  CFA"  is  used  as  an  address  to  access  the  Wait_Buffer,  and 
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simultaneously  its  field  CFS"  is  matched  against  the  stage  number  of  this  switch.  If 
they  match  then  the  entry  @CFA"  is  deleted  from  the  Wait_Buffer,  the  combine  field 
<CPS"  CFA">  of  the  message  is  replaced  by  the  combine  field  <CFA  CFS> 
extracted  from  the  Wait_Buffer,  and  a  new  message  is  created,  which  is  of  the  form 
<p'ra  CFS'  CFA'  lADD  OPC  DATA">,  where  DATA"  is  empty  for  store, 
DATA"=DATA  for  a  load,  and  DATA" =DATA+ DATA'  for  a  Fetch&Add. 

The  management  of  empty  slots  in  the  Wait_Buffer  is  done  by  keeping  a  separate 
queue  of  addresses  of  empty  slots,  the  Avail  queue.  Whenever  a  message  is  comtnned 
at  the  Combine_Queue  the  top  entry  of  the  Avail  queue  is  deleted,  and  is  inserted  as 
the  new  CFA'  field  of  the  combined  message.  Whenever  a  returning  message  has  a 
CFS  field  that  matches  the  current  stage  number  then  its  CFA  field  is  added  to  the  end 
of  the  Avail  queue.  The  management  of  the  empty  slots  address  queue  can  be  done 
concurrently  with  the  accesses  to  the  Wait_Buffer. 

If  the  Combine_Queue  and  the  Wait_Buffer  are  on  distinct  chips  then  the  Avail 
queue  is  kept  on  the  Wait_Buffer  chip.  On  the  Combine_Queue  chip  a  register  stores 
the  value  of  the  first  item  of  the  Avail  queue.  When  two  messages  are  combined  the 
new  CFA'  field  of  the  combined  message  is  obtained  from  this  register.  When  the  old 
message  is  transferred  from  the  Combine_Queue  to  the  Wait_Buffer,  then  the  first 
entry  is  deleted  from  the  Avail  queue,  and  a  new  address,  or  a  Buffer  full  fiag  is 
returned. 

5.   Partitioning 

As  stated  previously,  the  chips  implementing  the  UltraCoraputer  Network  are  pin 
intensive  since  each  switch  is  connected  to  eight  unidirectional  data  paths,  each  message 
wide.  We  have  already  shown  how  to  decrease  the  pin  count  by  pipelining  messages. 
Using  this  technique  and  a  proposed  packet  partitioning  of  4  packets  per  message,  the 
number  of  pins  per  packet  is  reduced  by  roughly  a  factor  of  four.    Using  still  more 
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packets  per  message  would  decrease  the  pin  count  even  more;  however,  such  increases 
have  a  throughput  cost  of  increased  the  network  congestion  as  well  as  the  latency  cost 
of  an  additive  factor  in  terms  of  network  transit  time. 

It  is  possible  to  have  a  functional  partition  of  the  switch.  A  two  chip  {partition  of 
the  switch  is  obtained  by  separating  the  PE-to-MM  path  from  the  MM-to-PE  path. 
Two  data  lines  connect  the  two  Combine  Queues  on  the  PE-to-MM  chip  to  the  two 
Wait_Buffers  on  the  MM-to-PE  chip.  Thus  each  chip  is  connected  to  six  data  lines, 
and  the  number  of  pins  per  packet  is  reduced  biy  roughly  a  factor  of  3/2.  A  five  chip 
partition  of  the  switch  is  obtained  by  having  each  combine  queue  and  each  wait  buffer 
on  a  separate  chip.  Each  chip  has  now  two  incoming  and  two  outgoing  data  lines,  so 
that  the  number  of  pins  is  reduced  by  roughly  a  factor  of  2  as  compared  to  a  one  chip 
version.  Note  however  that  the  the  ToMM  and  FrcraMM  ports  are  connected  to  two 
chips.  A  simple  external  arbiter  is  required  to  arbitrate  between  the  two  chips  feeding 
the  same  FromMM  port.  Note  that  three  types  of  chips  are  needed  for  a  switch  in  that 
configuration.  The  pin  count  can  be  further  reduced  (and  the  number  of  chips  per 
switch  further  increased)  by  using  redundant  hardware.  Each  of  the  chips  of  the  five 
chip  configuration  (arbiter  not  included)  have  two  incoming  and  two  outgoing  data 
lines.  We  can  duplicate  these  chips.  The  two  copies  cf  a  chip  will  be  each  connected 
to  the  two  input  lines  and  to  one  of  the  two  output  lines.  We  obtain  a  nine  chip  confi- 
giiration  with  a  pin  count  reduced  by  roughly  a  factor  of  8/3  from  the  original  confi- 
giu-ation  •. 

The  family  of  partitioning  for  the  USW  all  use  the  same  strucuires,  and  are  there- 
fore design  independent.  Thus,  the  decision  of  how  to  package  these  structures  can  be 
postponed  until  the  chips  are  fully  designed.  This  is  especially  advantageous  because, 
chip  packaging  technology  is  undergoing  rapid  change,  and  new  packages  with  much 


*  Net  all  che  hardware  has  to  be  replicated  --  The  FIFO  buffer  associated  with  the  output  pen  not  used  can  be 
deleted.  Also,  each  copy  needs  to  process  only  messages  atiich  desnaation  agree  \vith  the  number  of  the  out- 
put pan  to  wtiich  this  copy  is  conneaed  (ncxe  however  that  all  returning  messages  have  to  be  sent  to  the  wait 
buffer). 
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larger  pin  counts  will  become  available. 

Let  ra  be  the  size  of  a  packet  (currently  in  =16),  p  be  the  number  of  protocol 
lines  used  with  each  data  line  (currently  p=2),  and  c  be  the  number  of  extra  connec- 
tions needed  per  chip  for  power,  dock  and  setting  signals  (currendy  c=7).  A  one  chip 
configuration  requires  8(m+p)-t-c  pins  per  chip;  a  two  chip  configiiration  requires 
6(m+p)-(-c  pins  per  chip;  a  five  chip  configuration  requirf^s  4(ra-i-p)  +  c  pins  per  chip; 
finally,  a  nine  chip  configuration  will  require  3(m-i-p)+c  pins  per  chip.  For  the  current 
number  this  means  151  pins  for  a  one  chip  switch,  115  pins  for  a  two  chip  switch,  79 
pins  for  a  five  diip  swtch,  and  61  pins  for  a  nine  chip  switch.  A  one  chip  switch  with 
no  pipelining  would  require  535  pins.  Longer  words  (64  rather  than  32)  and  larger 
address  space  would  push  this  number  to  more  than  800.  Currently,  the  fabrication 
service  we  are  using  will  package  up  to  63  pins  per  chip,  so  that  a  nine  chip  configura- 
tion is  mandatory.  Packaging  techniques  of  several  hundreds  of  pins  per  chip  are  being 
forecast,  and  we  will  be  able  to  take  advantage  of  even  500  pins  or  more  without 
redesign.  Above  that  level,  2x2  switches  would  probably  be  replaced  with  4x4 
switches. 

5.1.   Merge  Chip  -  Walt  Chip  Communication 

If  the  MMtoPE  path  and  the  PEtoMM  path  are  on  separate  chips,  then  the 
PEtoMM  chip  must  send  to  the  MMtoPE  chip  the  message  that  was  combined  as  well 
as  the  extra  address  field.  If  there  are  N  packets  in  a  message  to  be  forwarded,  the 
wait_Buffer  port  will  receive  N+1  packets.  The  initial  packets  received  by  the  wait 
buffer  contains  the  old  message.  The  last  (N-l-lst  packet)  contains  the  fu-st  address 
packet  of  the  new  message.  This  will  occasionally  force  the  combine  queue  to  block 
for  two  cycles.  U  more  pins  are  available  then  the  transmission  of  the  path  descriptor 
of  the  old  message  is  done  in  parallel  with  the  transmission  of  the  new  message,  thus 
avoiding  this  extra  delay. 
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K  the  last,  more  general  scheme  is  used  for  the  combine_queue,  then  at  the  time 
the  transmission  is  begun  to  the  wait  buffer,  the  pipelined  chute  matching  may  not 
have  completed.  It  is  therefore  necessary  to  reconfirm  the  transfer  to  the  wait  buffer 
after  the  matching  has  been  completed  so  that  the  wait  buffer  knows  whether  to  aban- 
don the  acceptance  of  a  merged  message.  This  is  done  by  setting  the  ValidData  line 
high  during  the  address  end  cycle  of  the  transfer. 
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7.   Appendices  -  Protocols 

7.1.  TlmiDg  and  Handshaking 

Each  chip  in  the  node  contains  a  number  of  ports,  each  port  capable  of  receiving 
or  sending  part  of  a  message.  The  portion  of  the  message  each  port  can  handle  is 
called  a  packet.  The  chips  in  the  node  (with  the  exception  of  the  arbiter)  receive, 
switch,  store  and  forward  these  packets  along  a  pre-determined  path. 

The  ports  on  the  chip  consists  of  a  data  section  and  a  control  section.  Each  port 
is  unidirectional  (either  sender  or  receiver)  and  is  connected  to  a  unique  external  origin 
or  destination.  The  two  control  lines  associated  with  each  port  are  the  ValidData  line 
which  is  set  by  the  sender  and  the  DataAccept  line  which  is  set  by  the  receiver. 

There  are  two  constraints  on  message  transmissions. 

Once  transmission  has  begun  on  a  message,  a  packet  is  received  every  cycle  until 
the  last  packet  of  the  message  is  received. 

A  further  constraint  on  the  start  of  the  message  sending  sequence  is  due  to  the 
compact  parallel  queues  we  have  chosen  to  implement.  These  queues  require 
that  a  new  transmission  can  be  started  only  if  the  parity  of  the  current  c>'de  count 
is  correct.  For  example,  level  1  will  acrept  the  start  of  a  message  only  during 
odd  numbered  cycles;  level  2  wiU  accept  the  start  of  a  message  only  during  even 
numbered  cycles.  Since  passage  through  a  stage  in  the  network  requires  at  least 
one  cycle,  this  added  restriction  is  only  meaningful  at  the  endpoints  of  the  net- 
work. 

A  transmission  of  a  message  starts  at  a  cycle  of  correct  parity  where  Hoth  DataAccept 

and  ValidData  are  set.  Thus 

A  sender  must  assen  the  Data  Valid  signal,  and  assert  the  first  message  packet  on  the 
data  lines  at  any  cycle  of  correct  parity  where  it  is  ready  to  start  transmission  of  a  mes- 
sage. 
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A  receiver  must  assert  the  DataAccept  signal  at  any  cycle  of  correct  parity  where  it  is 
ready  to  start  reception  of  a  message  (and  sure  to  be  able  to  receive  an  entire  mes- 
sage). 
During  message  transmission,  and  at  cycles  of  wrong  parity  the  DataAccept  and  ValidData 
lines  are  ignored,  effectively  allowing  these  lines  to  be  "st  up  many  cycles  before  the  start  of 
a  transmission. 
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7.2.   Data  Format 

The  network  nodes  have  been  designed  to  be  as  flexible  as  possible,  with  as  little  of  the 
exact  format  of  the  data  embedded  in  the  circuitry  as  possible.  Thus  the  design  wil]  allow  the 
postponement  of  packaging  decisions  by  rendering  the  circuit  insensitive  to  packaging  parameters. 
This  is  partiaJarly  important  as  the  most  critical  constraint  in  the  design  of  the  nodes  is  pnn  limita- 
tion in  the  standard  packaging  technologies.  Standard  technologies  all  provide  less  than  100  pins, 
while  the  newer  technologies  are  promising  several  hundred  pins.  In  view  of  this  constraint, 
several  partitions  of  the  chip  were  designed,  each  for  a  different  number  of  pins  and  each  provid- 
ing the  same  performance,  albeit  with  different  pin  counts.  Each  partition  is  design  insensitive 
although  they  require  modest  changes  to  the  chip.  A  change  in  packet  size  necessitates  the  widen- 
ing (or  narrowing)  of  data  paths  on  chip.  If  the  number  of  packets  per  message  changes,  the 
modifications  to  the  chip  are  limited  to  the  front  end. 

The  following  is  a  list  of  parameters  associated  with  the  data  format. 

s(op)  Size  of  opcode 

s(baddr)  Size  of  bank  address 

s(MMaddr)  Size  of  memory  module  address 

s(stnum)  Size  of  stage  number 

sO-VBadd)  Size  of  Wait_Buffer  address 

s(data)  Size  of  data 

PacketSize  Number  of  bits  per  packet. 

There  are  a  number  of  assumptions  made  about  the  format  of  data  to  the  chip  and  the  tim- 
ing of  data  arrival.  In  each  case  the  assumptions  either  Lmpose  no  overhead  to  the  system,  are 
optimal,  or  would  significantly  simplify  the  design. 

1.       The  message  is  broken  up  into  at  least  two  packets  with  address  followed  by  data. 
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2.       Address  and  data  do  not  share  packets. 

3a.     The  format  of  address  packets  at  the  PNI  is: 


Opcode 

high  order 

Packet_Width-S(MMaddr)-s(op) 

bits  of  address  within  MM 

MM#' 

low  order  packet. 


address  within  MM 


remaining  packets. 
3b.     The  format  of  the  address  packets  at  stage  i  in  the  network  is: 


Op  Code 


rightmost  i 
bitsof  PE#' 


high  order 
Packetsize-S(MMadd)-s(op) 
bits  of  address  within  MM 


leftmost 

(S(MMaddr)  -  i) 

bits  of  MM# 


low  order  packet. 
The  other  packets   are   identical   to   those   at   the  first  level**. 

s(MMaddr)  +  s(op):£PacketSize. 
The  format  of  the  data  packet  is: 


This  implies  that 


Data 


5.       Packets  within  Data  are  always  low  order  packets  first;  packets  within  Address  are  high 
order  packets  first.  This  is  necessary  to  route  and  to  perform  pipelined  addition. 

If  the  alternative  combining  scheme  that  requires  a  combine  field  at  each  message  is 
used  then  the  combine  field  is  appended  to  the  left  of  the  first  address  packet,  before  the  Op 
Code  field.  This  implies  that  S(MMaddr)  +  s(op)  +  s(stnum)  +  s(WBaddr):£PacketSLze. 

7.3.   Current  Design 

The  current  design  parameters  are 


'  the  NtM#  (PE#)  is  really  the  addie^s  of  the  memory  module  (processing  element)  \».-iih  the  bits  is  reverse 
order  (i.e.  n-jxi  significant  bit  rightmost).  For  example,  at  stage  one  «'e  "■'oold  route  according  to  the  right- 
most bit  of  the  MNI#. 
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s(op)  =  2 

s(baddr)  =  21 

s(MMaddr)  =  6 

s(stnum)  =  3 

s(WBaddr)  =  4 

s(data)  =  32 

This  corresponds  to  a  64  PE  machine  with  64  MM's  each  with  one  megabyte,  and  16 
locations  in  each  Wait.Buffer.  Memory  is  organized  in  32  bits  words,  but  can  be  accessed  in 
words,  half-words  or  bytes.  The  last  three  bits  of  the  address  encodes  the  address  mode. 
Since  messages  are  combined  only  if  their  address  field  match,  only  messages  that  access  the 
same  part  of  the  same  word  are  combined. 

The  current  Op  Codes  are  as  follows: 

00  Not  Used 

01  LOAD 

10  STORE 

11  FETCH&ADD 

A  stage  number  of  11  IB  in  the  combine  field  indicates  a  noncombined  message. 

Two  packets  are  used  for  address,  and  two  packets  are  used  for  data.  If  no  combine 
field  is  used  (associative  search  in  the  Wait_Buffer)  then  the  packet  size  is  16.  This  leaves 

free  bit  an  address  packet.  If  combine  fields  are  used,  a  packet  size  of  18  is  required, 
ith  no  extra  free  bits.  The  possible  addition  of  one  parity  bit  per  packet  is  contemplated. 

If  tlie  design  that  uses  associative  search  in  the  Wait.Buffer  is  used  then  a  contem- 
plated option  is  to  forsake  combining  for  STORE  messages.  This  feature  is  probably  not 
useful,  and  as  a  result  of  forsaking  it,  uncombining  will  occur  only  on  messages  of  length 
four.  This  requires  a  slight  change  in  the  Combine_Queue. 


one 


wii 
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7.3.1.   Chip  control 

We  assume  that  each  chip  of  the  switch  is  ertcmalJy  driven  by  two  nonoverlapping  sig- 
nals of  a  two_phase  dock.  A  chip  may  be  in  one  of  four  states 

1.  run 

2.  hold 

3.  reset 

4.  shift 

1  (run)  is  the  normaJ  working  state,  state  2  (hold)  freezes  the  values  of  all  the  internal 
registers.  State  3  (reset)  is  used  to  initialize  the  interna]  registers,  and  state  4  (shift)  is  used 
to  dump  out  the  content  of  the  internal  registers.  The  chip  state  is  controlled  by  the  two 
inputs  Reset  and  IrAi^EnoT. 

All  the  internal  registers  that  have  to  be  initialized,  or  dumped  in  a  shift  state  are  con- 
nected in  a  shift  line,  with  a  shift,„  input,  and  a  shifty,,  output. 
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7.3.2.  Packaging 

We  have  considered  partitionings  of  an  Ultracomputer  node  into  one  chip,  two  chips, 
four  chips,  and  nine  chips.  Di-e  to  pin  limitations,  we  have  never  considered  packaging  mul- 
tiple nodes  onto  a  single  chip,  although  this  will  eventually  become  feasible.  Furthermore, 
we  have  not  tried  to  bit  slice  across  nodes  in  order  to  reduce  the  complexity  of  the  first  chip 
set. 

Presented  here  are  two  partitions  which  are  viable  under  currently  available  packaging 
facilities.  Both  of  these  partitions  enable  packet  widths  of  16  bits  (18  bits),  which  was  felt  to 
be  desirable  for  performance  reasons  (network  congestion  is  related  to  the  number  of  pack- 
ets, not  the  number  of  messages.)  It  should  be  noted,  however,  that  the  design  will  support 
packet  widths  of  any  size. 

The  first  design  presented  is  for  the  9  chip  node.  A  schematic  is  given  below,  as  well 
as  the  equations  for  the  external  arbitration  logic. 
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Figure  4. 
The  above  diagram  is  for  the  nine  chip  version  of  the  ultracoraputer  node.   The  arbiter  chip 
is  not  shown. 


Associated  with  every  partitioning  of  the  chip  is  an  off  chip  arbitration  mechanism,  that 
is  an  extra  arbiter  chip.  While  this  chip  is  very  small  in  comparison  to  the  switch  and  merge 
chips,  it  is  not  located  on  one  of  the  other  chips  because  it  is  pin  intensive.  In  the  above 
design,  In,  Out,  Ro,  and  Ri  control  lines  must  all  be  connected  to  the  arbiter.  Since  there 
are  2  control  lines  for  each  of  the  ports  and  2  port  for  each  of  the  sides,  there  are  already 
over  16  pins  just  for  data.  Below  are  the  equations  for  uie  arbitration  that  must  be  per- 
formed for  this  chip. 


AS 


1.       no<UJn,.DatajKccept    :=    mln    switch^.ln,.Daia_j\ccept   switch ^.In^.Validjyata    :  = 
node.ln,.Valid_Pata  and  node.In,.Data_j\ccept 

3.  node. Ro,. Valid J?ala  :=  max  wail. Ro,. Valid J^aia 

4.  waiif.Ro ,.Datayi.ccept  :=  node.Ro,.Datau^ccept 

and  tum(wa/f, ,  waiti.Ro^.Valid_Paia) 

5.  node. Ri^. Data _j\ccfpt  :=    mln    wait^.Ri,.Data_Accept 

6.  wail.Ri  .Valid J)ata    :-  m^Je. Ri,. Valid JJala    and  node  Ri,.DaiaJ\ccepl 

7.  node. Buff,  Data^ccept  -.^    mln    wail, .Buff,. Data J^ccept 

8.  wait. Buff, .ValtdJ)ata    :=  switch, . Buff ,.ValufJ}ata 

and    node. Buff ^.Data^\ccept 

Arbitration  Equations. 

Below  is  a  table  of  the  pin  counts  for  9  and  5  chip  partjtiomngs  assuming  16  bit  wide 


data  packets  with  no  combine  field,  or  18  bits  with  a  combine  field. 


Pin  Counts 

NoCF 

WithCF 

Pin 

5  chip 

9  chips 

5  chip 

9  chip 

Vdd 
Gnd 

Qock  Phil 
Qock  Phi2 
Reset 
Init/Error 
Shift,„ 
Shift.,, 

1 
1 
1 

1 
1 
1 
1 
1 

WB  address 

- 

- 

4 

4 

Packets  per  Chip 

4 

3 

4 

3 

Data  bits 

ValidData 

DataAiXcpt 

16 
1 
1 

16 
1 
1 

18 
1 
1 

18 
1 
1 

Total  port  pins 

72 

54 

80 

60 

Total  pins  used 

78 

60 

90 

70 

Tabic  1.  Minimum  number  of  pins  per  chip 


Ro  lines  muit  be  cri-state  with  the  cri-staie  concrdled  by  DataAcccpt. 
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